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Abstract 



A simple analytical framework to study the molecular quasispecies evo- 
lution of finite populations is proposed, in which the population is assumed 
to be a random combination of the constituent molecules in each generation, 
i.e., linkage disequilibrium at the population level is neglected. In particular, 
for the single-sharp-peak replication landscape we investigate the dependence 
of the error threshold on the population size and find that the replication 
accuracy at the threshold increases linearly with the reciprocal of the popula- 
tion size for sufficiently large populations. Furthermore, in the deterministic 
limit our formulation yields the exact steady-state of the quasispecies model, 
indicating then that the population composition is a random combination of 

the molecules. 
87.10.-he, 64.60.Cn 
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I. INTRODUCTION 



An important issue in the investigation of the dynamics of competing self-reproducing 
macromolecules, whose paradigm is Eigen's quasispecies model is the effect of the finite 
size of the population on the error threshold phenomenon that limits the length of the 
molecules [|]. The quasispecies model was originally formulated as a deterministic kinetic 
theory described by a set of ordinary differential equations for the concentrations of the 
different types of molecules that compose the population. Such formulation, however, is valid 
only in the limit where the total number of molecules goes to infinity. More pointedly, 
in this model a molecule is represented by a string of v digits (si, S2, . . . , s,^), with the 
variables allowed to take on k different values, each of which representing a different type 
of monomer used to build the molecule. For sake of simplicity, in this paper we will consider 
only binary strings, i.e., = 0, 1. The concentrations Xi of molecules of type i = 1,2, ... ,2'^ 
evolve in time according to the following differential equations 

^ = E w.,x, - [A + $ m X, , (1) 

where the constants Di stand for the death probability of molecules of type i, and is a 
dilution flux that keeps the total concentration constant. This flux introduces a nonlinearity 
in (|I]), and is determined by the condition J2idxi/dt = 0. The elements of the replication 
matrix Wij depend on the replication rate or fitness Ai of the molecules of type i as well as 
on the Hamming distance d between strings i and j. They are given by 

Wu = Aq'' (2) 

and 

where < g < 1 is the single-digit replication accuracy, which is assumed to be the same for 
all digits. Henceforth we will set Di = for all i. The quasispecies concept is illustrated more 
neatly for the single-sharp-peak replication landscape, in which we ascribe the replication 
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rate a > 1 to the so-called master string (1,1,...,1), and the replication rate 1 to the 
remaining strings. In this context, the parameter a is termed selective advantage of the 
master string. As the error rate 1 — g increases, two distinct regimes are observed in the 
population composition: the quasispecies regime characterized by the master string and its 
close neighbors, and the uniform regime where the 2" strings appear in the same proportion. 
The transition between these regimes takes place at the error threshold 1 — qt, whose value 
depends on the parameters u and a A genuine thermodynamic order- disorder phase 

transition occurs in the limit z/ — oo only We must note, however, that standard 

statistical mechanics tools developed to study the surface equilibrium properties of lattice 
systems can be used to investigate the finite z/ case as well Moreover, the complete 

analytical solution of the single-sharp-peak replication landscape has been found recently 
by mapping the stationary solution of the kinetic equations ([l|) into a polymer localization 



Closely related to our approach to the quasispecies evolution of finite populations is the 
population genetics formulation of the deterministic quasispecies model proposed recently 



number of monomers 1 they have, regardless of the particular positions of these monomers 
inside the molecules. Hence there are only u+l different types of molecules which are labeled 
by the integer P = 0, 1, . . . , z/. This assumption is not so far-fetched since the feature that 
distinguishes the molecules is their replication rates Ai, which in most analyses have been 
chosen to depend on P only, i.e., = Ap [0. Furthermore, denoting the frequency of 
monomers 1 in generation t hj pt, it is assumed that the molecule frequencies Ilp{t) are 
given by the binomial distribution 



for P = 0, 1, . . . , z/. Thus, in each generation the monomers are sampled with replacement 
from a pool containing monomers 1 and in the proportions pt and 1—pt, respectively. This 
amounts to neglecting linkage disequilibrium, i.e., in each generation the molecule frequen- 



problem [^0. 



0. In that formulation it is assumed that the molecules are characterized solely by the 
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cies are random combinations of the constituent monomers [§. With the two assumptions 
presented above a simple recursion relation for the monomer frequency pt can be readily 
derived |^. 

To take into account the effect of finite A^, the deterministic kinetic formulation must 
be replaced by a stochastic formulation based on a master equation for the probability dis- 
tribution of the number of different types of molecules in the population However, 
the extreme approximations used to derive results from that master equation or from re- 
lated Langevin equations |Tl|JI^ have hindered the analysis of the error threshold for finite 
populations. An alternative approach to study stochastic chemical reaction networks is the 
algorithm proposed by Gillespie |T^, which has been successfully employed to simulate nu- 



merically the quasispecies model, providing thus a base line for analytical investigations 
\\14\\ . The goal of this work is to propose an analytical framework to investigate the quasis- 
pecies evolution of finite populations. More specifically, we will focus on the evolution of 
the molecule frequencies Ilp{t) for P = 0, . . . , z/ and, since for finite these frequencies are 
random variables, we will derive a recursion relation for the average values Ilp{t). Although 
we will concentrate mainly on the dependence of the error threshold on the population size 
A^, the formalism presented in the sequel can be applied to study a variety of fascinating 



phenomena related to the finitude of the population, such as mutational meltdown |]15l and 



punctuated equilibria or stasis [T^JT^, to mention only a few. Moreover, since modern the- 
ories of integration of information in pre-biotic systems involve the compartmentation of a 
small number of molecules (typically 10 to 100) [|l^, the understanding of the effects of the 
error propagation in finite populations has become an important issue to the theories of the 
origin of life. 

II. THE MODEL 

In each generation the population is described by the vector n = [uq, . . . , n^) where np 
is the number of molecules of type P, so that J^p^p = Similarly to the deterministic 
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case , we have to resort to a simplifying assumption to relate the molecule frequencies 11 p 



to the vector n. In particular, in generation t we consider a molecule pool containing the 
different molecule types in the proportions lip, so that n is distributed by the multinomial 
distribution 



Hence in each generation the molecules are sampled with replacement from the molecule 
pool. In this sense, in each generation the population is a random combination of the 
constituent molecules, which amounts to neglecting linkage disequilibrium at the population 
level. More pointedly, the population composed of the offspring of the molecules present in 
the generation t — 1 is destroyed and its molecule frequency T\.p{t) used to create an entire 
new population according to (||). Although this procedure destroys the correlations between 
the molecules, it does not cause any significant loss of genetic information since the fitness of 
the molecules depend only on the number of monomers 1 they have, which, in the average, 
is not affected by the procedure. 

The changes in the population composition n are due to the driving of natural selection, 
modeled by the replication rate Ap, and to mutations, modeled by the error rate per digit 1 — 
q. Following the prescription used in the implementation of the standard genetic algorithm 
|T8| , we consider first the effect of natural selection and then the effect of mutations. As 
usual we assume that the number of offspring that each molecule contributes to the new 
generation is proportional to its relative replication rate which, for molecules of type P, is 
defined by 



Thus the population composition after selection is described by the random vector n' = 
(rig, . . . ,n^) which is distributed according to the conditional probabihty distribution 






(6) 



V, (n' I n) 



(7) 
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Next we consider the changes in n' due to mutations. After mutation, the population is 
described by n" = (?t,q, . . . , n'D whose components are written as 

where the integer n"p^ stands for the number of molecules of type R that have mutated to a 
molecule of type P. Clearly, n'j^ = Y,p n"p^. We note that the probability of mutation from 
a molecule of type i? to a molecule of type P is given by 



M, 
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^u-P-R+2Q (i_qf+R^^Q^ (9) 



where Qi = max {0, P + R — u) and = min (P, R). The population is more conveniently 
described by the set {n'pp} rather than by n". In fact, given the conditional probability 
distribution of {upp} is again a multinomial 

Vm Kh, <n, • ■ ■ , I O = „ , ff Kr • • • m:1- (10) 

^QR- ^IR- ■ ■ ■ ^uR- 

for i? = 0, . . . , z/. In this framework the frequency of molecules of type P in the next 
generation np(t + 1) is given simply by j^J2r^pr- This frequency is used to generate the 
new population of molecules of length z/ according to the distribution (|^). The procedure 
is then repeated again. 

We have run simulations for the single-sharp-peak replication landscape using the pro- 
cedure described above, which neglects linkage disequilibrium at the population level, as 



well as the standard genetic algorithm |]I8[, in which the correlations between consecutive 
generations are maintained. We have focused on the effect of the error rate 1 — g on the 
normalized mean Hamming distance d between the master string and the whole population 
in the stationary regime. This quantity is given by the fraction of monomers in the entire 
population, i.e., d = ■^I]p('^ — P)^p- In Figs. (|I]) and (0) we present the results of the 
simulations for d and its standard deviation cr, respectively, as functions of the error rate 
1 — q. The initial population is set with 11,^ = 1 and Up = for P ^ u, and it is left to evolve 
for 2 10^ generations. No significant differences were found for longer runs or for different 



choices of the initial molecular frequencies. Each data point involves two kinds of average: 
for each run we average over the mean Hamming distance in the last 100 generations; this 
value is then averaged over 200 runs. We note that even if the populations are identical 
in the initial generation, the random character of the transitions n ^ n' ^ n" will make 
them distinct in the next generation. It is clear from these results that, as N increases, the 
quantitative effects of assumption (^) become less significant. Moreover, the dependence of 
d and cr on the error rate is qualitatively the same for both algorithms. 



To derive an analytical recursion relation for the average molecular frequencies Ilp{t) we 
consider the following approximate procedure, akin to the annealed approximation of the 
statistical mechanics of disordered systems, which facilitates greatly the analysis: instead 
of averaging over the populations only after the stationary regime is reached, we perform 
this average in each generation. The result obtained Ilp{t) is then used to build the new 
populations. Of course, in doing so we neglect the fluctuations of Hpit) for the different 
runs. Within this framework the average frequency of molecules of type P in generation 
t + 1 is written as 



III. RECURSION EQUATIONS 



Up{t + 1) 



1 
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Noting that ^pMpn = 1 and J2rWr (n) = 1, we can easily verify that the normahzation 
condition YliP^p{t + 1) = 1 is satisfied. To proceed further we must specify the rephcation 
rate Ap. In the case of the single-sharp-peak replication landscape, i.e., = a and Ap = 1 
for P 7^ z/, the summations over nQ, . . . , n^^i can be readily carried out. The final result is 

Efi=onH(t) \Mpr + a^Mp, 



Y[p{t + 1) = Mp 
for P = 0, . . . , I/. Here 

Bn - 



N 



N-l 



n,(t) + ^ p„- 



n=0 



H-r (a - 1) 



(15) 



(n-i\ 



V 



n 



li 

n,(t) 1 - n,(t) 



(16) 



and r = n/N. Thus, given the initial average molecular frequencies Ilp{t = 0) for P = 
0, . . . , I/, equations (|T^) are iterated till the stationary regime is reached. 

Before we proceed on the analysis of the stationary solutions of the recursion equations 



(|T^), some comments regarding the definition of the error threshold are in order. A popular 
definition of error threshold is the error rate at which the master frequency vanishes 
imj^. The problem with this definition is that, even in the deterministic limit oo, U^y 

never vanishes for finite v. The vanishing of the master frequency is an artifact of neglecting 
reverse mutations which can be justified in the limit v ^ oo only. In particular, for the 
single-sharp-peak replication landscape this prescription yields a very simple equation for 
the replication accuracy at the threshold in the deterministic regime 0,0], 

1 



Ingt 



In a. 



(17) 



We must emphasize that for finite v this equation is an approximation only. A more ap- 
propriate definition of the error threshold, which is useful for the finite A^ well, is 



obtained by considering the statistical properties of the entire molecular population [|19 



In particular, we focus on the normalized mean Hamming distance d between the master 
sequence and the entire population and define the error threshold as the error rate at which 



the standard deviation a is maximal ||T9| 
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In Figs, m and |^ we show the theoretical predictions for d and a using the steady-state 
solution of the recursion equations (|I5|). As expected, the effects of the fluctuations in 
Ilp{t) for the different runs are stronger for small and hence our analytical approximation 
yields very poor results in this case, although it reproduces quite well the qualitative behavior 
pattern of the quantities measured. However, already for = 100 there is a good agreement 
between the theoretical predictions for d and the simulation results, provided that 1 — g is 
not too near the threshold transition. Rather surprisingly, that agreement is better for 
the standard genetic algorithm. As expected, however, the theoretical predictions for the 
standard deviation a are very poor since our approximation scheme neglects the fluctuations 
in rip between the different runs, which are directly measured by cr. Nevertheless, the 
qualitative features of the simulation results are again well described by the theoretical 
curves. We note, in particular, the abrupt increase of a as the error threshold is approached 
from below and the slow decay as the error rate increases further. The agreement between 
theory and simulation becomes better as A^ increases. In Fig. ^ we show the replication 
accuracy at the threshold function of the reciprocal of the population size. The 

increase of qt with decreasing A^ is expected since the fluctuations become stronger for 
small A^ and so the replication must be more accurate in order to keep the master string 
in the population. In particular, for large A^ we find that qt increases linearly with 1/A^. 
This result is in disagreement with the predictions of the birth and death model of error 
threshold proposed by Nowak and Schuster, which predicts that qt increases with 1/\/N for 
large A^ [Q. We note that, despite the claim of those authors, it is not possible to discern 
whether qt increases with 1/A^ or from their numerical data obtained using Gillespie's 



algorithm [14 



Another interesting phenomenon, termed stochastic escape, is the loss of the master 
string in a finite population [pO|-p2|. In the limit u oo this loss becomes irreversible. 



since no reverse mutation will be able to restore the master string. We can easily derive a 
lower bound to the probability that the master string is absent from the population using 
the inequahty 



l-n^< Pr{n^ = 0}, (18) 

which follows trivially from the fact that n^, > 0. Using n^, = A^n,^, we can find the 
replication accuracy qi such that the condition Yl^ = 1/N is satisfied for fixed u and a. 
Clearly, for q < qi the probability that the master string is absent from the population is 
nonzero. However, since this probability may be nonzero for q > qi a.s well, qi gives only 
a lower bound to the value of the replication accuracy below which the stochastic escape 
phenomenon actually takes place. This bound is presented in Fig. |^ as a function of the 
reciprocal of the population size. In the limit ^ oo we find qi —>■ for finite u, since in 
the deterministic regime flu is bounded by 1/2*^ > 0. It is interesting that for not too 
large we find qi > qt so that the master string is likely to be absent from the quasispecies 
for replication accuracies in the range qt < q < qi- 

We turn now to the analysis of the deterministic regime, — » oo. In this case, the sum 
in Eq. (0) is dominated by the closest integer to (A^ — l)Il^{t), so that r llu{t) and the 
recursion equations ( pTSD reduce to 

^ E^-JoMp,n,(t) + aMp.n.(t) 

^ ^ l + n,(t)(a-l) ^ ^ 

for P = 0, . . . , L. 

To appreciate the relevance of the present formulation of the quasispecies model, we 
compare it with the exact solution of the kinetic equations (P for finite u. In fact, as 



pointed out by Swetina and Schuster |2^, for the type of replication landscape considered 
in this paper, the 2'^ molecular concentrations Xi can also be grouped into z/ + 1 distinct 
classes according to the number of monomers 1 that compose the molecules. This procedure 
allows the description of the chemical kinetics by only z/ + 1 coupled first-order differential 
equations. In particular, for the single-sharp peak landscape the concentrations of molecules 



in class P = 0, . . . , z/, denoted by Yp with I]p = 1, obey the differential equations |23 



dY, 



p 



u-l 



dt 



J2 MpR Yn + aY.Mp. - ^ [1 + ^ (a - 1)] . (20) 
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It is clear then that both models (|T9D and ( pOD possess the same stationary state. This very 
interesting finding indicates that in the quasispecies model there is no linkage disequilibrium 
at the population level in the stationary regime, i.e., the population is a random combination 
of the constituent molecules. In fact, this results holds true for any choice of the replication 
landscape Ap, as can be easily verified by taking the limit — oo in Eq. 

It is also interesting to compare the deterministic limit of our model (|T9|) with the 
population genetics approach to the deterministic quasispecies model [0 . This can easily be 
done by writing down a recursion equation for the frequency of monomers 1 in the population 
Pt = lEpP'^pit), namely, 

1 + (a - 1) Il^[t) 



Clearly, this equation is useless since one must solve the general recursion equations (|T9|) 
in order to find Ilu{t). However, the population genetics formulation makes use of the 
binomial assumption (|^) to set Ilu{t) = p\ so that the recursion equation (^) will involve 
the monomer frequencies only 0. 

In Fig. § we present the logarithm of the replication accuracy at the threshold In qt as 
a function of the logarithm of the selective advantage In a for several values of v for the 
deterministic case. There is a good agreement with (|1^ for a ~ 1 and, as expected, this 
agreement becomes better as v increases. We have also verified that the prediction of the 
population genetics approach yields a very poor approximation for the location of the 
error threshold. Furthermore, we have verified that Hp departs significantly from a binomial 
distribution only near the threshold transition. Aside this region, the population genetics 
approach provides a reliable and concise description of the deterministic quasispecies model. 



IV. CONCLUSION 

In this paper we have proposed a simple analytical model, based on the neglect of linkage 
disequilibrium, to study the error propagation in the quasispecies evolution of finite popu- 
lations. In particular, our finding that in the deterministic regime this model yields exactly 
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the same stationary state of the original kinetics model ||2^ implies that the steady-state 



molecular population of the deterministic quasispecies model is a random assembly of the 
component molecules. 

Some comments regarding the comparison of our approach with previous population 
theoretical analyses of the finite quasispecies model [^,0 are in order. These works 
provide approximate formalisms to study the evolution of finite populations on a multiplica- 
tive single-peak fitness landscape without neglecting linkage disequilibrium. Interestingly, 
for this fitness landscape, which is given by Ap = (1 — aY^^ with < a < 1, the binomial 



assumption (H) yields the exact solution to the deterministic equations for lip |21|]. More- 
over, in the weak selection limit {a ^ 1) the corrections to the deterministic value of 
the mean Hamming distance between the master sequence and the whole population can 
be calculated analytically [^]. An alternative formalism concentrates on the evolution of 
the ensemble average of the first cumulants of the distribution of fitness in the population 



2J]. It is not clear, however, whether these quantities can be related to the more natural 
measures of the population composition, namely, d and a. We note that the location of the 
error threshold is not addressed in these works. 

An important open question, which can be answered through intensive numerical simula- 
tions only, is the dependence of the replication accuracy at the threshold on the population 
size for large populations. In fact, the numerical data existent in the literature do not allow 
to distinguish between the 1/A^ dependence predicted by our model and the 1/VN depen- 
dence predicted by the birth and dead model JI^. We think that simulations based on 
genetic algorithms rather than on Gillespie's algorithm may prove more effective to address 
this issue. 
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FIGURES 

FIG. 1. Steady-state normalized mean Hamming distance between the master sequence and the 
whole population as a function of the error rate per digit for = 10 (v); ^ = 100 (O)- The 
full symbols are the results obtained with the algorithm that neglects linkage disequilibrium, while 
the empty symbols are the results obtained with the standard genetic algorithm. The theoretical 
prediction is given by the solid curves. The dashed line is the prediction for oo. The 

parameters are u = 10 and a = 10. 

FIG. 2. Steady-state standard deviation of the normalized mean Hamming distance between 
the master sequence and the whole population as a function of the error rate per digit. The 
parameters and convention are the same as for Fig. (||). 

FIG. 3. Replication accuracy at the error threshold qt (solid curves) and lower bound to the 
replication accuracy below which the stochastic escape phenomenon occurs qi (dashed curves) 
as functions of the reciprocal of the population size for a = 10 and (from bottom to top) 
= 6, 8, . . . , 20. 

FIG. 4. Logarithm of the replication accuracy at the threshold in the deterministic regime as 
a function of the logarithm of the selective advantage for (from top to bottom ) = 5, 10, 15, 20 
and 25. The solid curves are obtained using Eq. (|l9| ) and the dashed straight lines are given by 
Eq. (|l3). 
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